Pragmatic Chinese Lexical Analysis Based on Word-character Hybrid Model

نویسندگان

  • Xiao Sun
  • Degen Huang
چکیده

In the field of information and natural language processing, Chinese lexical analysis is important basic step for Chinese, Japanese or other asian language. This paper presents Chinese lexical analysis integrating word-level and character-level information based on hybrid model combining word-based CRF model and latent semi-CRF model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from word-lattice by using rich and flexible predefined features. Latent semi-CRF model is adopted in unknown word processing, which is character-based and invoked when no matching word can be found in a lexicon for building the lattice. This pragmatic method based on hybrid CRFs models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis. First, flexible feature designs for hierarchical tag sets become possible. Second, influences of label and length bias are minimized. Third, the word-level information for the known words and the character-level information for the unknown words can be combined and fully used.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HMM and CRF Based Hybrid Model for Chinese Lexical Analysis

This paper presents the Chinese lexical analysis systems developed by Natural Language Processing Laboratory at Dalian University of Technology, which were evaluated in the 4th International Chinese Language Processing Bakeoff. The HMM and CRF hybrid model, which combines character-based model with word-based model in a directed graph, is adopted in system developing. Both the closed and open t...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

A Hybrid Word-Character Model for Abstractive Summarization

Abstractive summarization is the popular research topic nowadays. Due to the difference in language property, Chinese summarization also gains lots of attention. Most of studies use character-based representation instead of word-based to keep out the error introduced by word segmentation and OOV problem. However, we believe that word-based representation can capture the semantics of the article...

متن کامل

A Chinese LPCFG Parser with Hybrid Character Information

We present a new probabilistic model based on the lexical PCFG model, which can easily utilize the Chinese character information to solve the lexical information sparseness in lexical PCFG model. We discuss in particular some important features that can improve the parsing performance, and describe the strategy of modifying original label structure to reduce the label ambiguities. Final experim...

متن کامل

Mandarin Word-Character Hybrid-Input Neural Network Language Model

We applied neural network language model (NNLM) on Chinese by training and testing it on 2011 GALE Mandarin evaluation task. Exploiting the fact that there are no word boundaries in written Chinese, we trained various NNLMs using either word, character, or both, including a wordcharacter hybrid-input NNLM which accepts both word and character as input. Our best result showed up to 0.6% absolute...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010